🏣 Marking + Grades + Due date

🏗️Instructions

The data files are located inside the data folder in your assignment R projects.

To complete the assignment, you will need to fill in the blanks with the appropriate function names, arguments, or other names. These sections are marked with ___. At a minimum, your assignment should be able to be “knitted” using the Knit button for your Rmarkdown document.

📑 Report

Introduction

The data that we are going to study in this assignment is about the Olympics games. The data span from the first Olympics held in Athens in 1986 to the 2016 Olympics held in Rio de Janeiro. We are going to run some modeling on this data. The purpose of this assignment is to show you the typical tasks performed by a data analyst from analyzing data to performing modeling to gain insights.

There are 2 data files in this assignment:
1. athlete_events.csv: The individual athlete competing in the Olympic events.
2. noc_regions.csv: National Olympic Committee with its country name.

The variables in athlete_events.csv are:

  1. ID
  2. Name
  3. Sex - Female, Male
  4. Age
  5. Height — In centimeters
  6. Weight — In kilograms
  7. Team — Team name
  8. NOC — National Olympic Committee (3-letter code)
  9. Games — Year and season
  10. Year — Integer
  11. Season — Summer, Winter
  12. City — Host city
  13. Sport — Sport
  14. Event — Event
  15. Medal — Gold, Silver, Bronze, or NA.

Section A: Data Preparation

1. Load all the essential packages here. [Hint: There is about 10 here] [3m]

# all packages here
library(tidyverse)
library(rvest)
library(naniar)
library(plotly)
library(ggrepel)
library(readxl)
library(janitor)
library(polite)
library(ggResidpanel)
library(broom)

2. Import the athlete_events data. We will also remove the 1906 Olympic Games which is not recognized by the International Olympic Committee. [3m]

data_raw <- read_csv("data/athlete_events.csv") 
## Rows: 271116 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (10): Name, Sex, Team, NOC, Games, Season, City, Sport, Event, Medal
## dbl  (5): ID, Age, Height, Weight, Year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data <- data_raw %>% 
  filter(Year != "1906") #2m

3. Get the full region name of NOC (National Olympic Committees) and merge it with the athlete data set. [hint: use the noc_regions.csv and notes column is not needed.] [4m]

noc <- read_csv("data/noc_regions.csv")
## Rows: 230 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): NOC, region, notes
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
data_noc <- data %>% 
    left_join(noc %>% select(NOC, region),
              by = "NOC")

head(data_noc)
## # A tibble: 6 × 16
##      ID Name    Sex     Age Height Weight Team   NOC   Games   Year Season City 
##   <dbl> <chr>   <chr> <dbl>  <dbl>  <dbl> <chr>  <chr> <chr>  <dbl> <chr>  <chr>
## 1     1 A Diji… M        24    180     80 China  CHN   1992 …  1992 Summer Barc…
## 2     2 A Lamu… M        23    170     60 China  CHN   2012 …  2012 Summer Lond…
## 3     3 Gunnar… M        24     NA     NA Denma… DEN   1920 …  1920 Summer Antw…
## 4     4 Edgar … M        34     NA     NA Denma… DEN   1900 …  1900 Summer Paris
## 5     5 Christ… F        21    185     82 Nethe… NED   1988 …  1988 Winter Calg…
## 6     5 Christ… F        21    185     82 Nethe… NED   1988 …  1988 Winter Calg…
## # … with 4 more variables: Sport <chr>, Event <chr>, Medal <chr>, region <chr>

4. Scrap the data about the host countries from the web (https://en.wikipedia.org/wiki/List_of_Olympic_Games_host_cities). The objective of this is to get the name of the host country of each Olympics, which will later be merged together with the data from the athlete_events data set. The output should have information on the City, Country, Year and Continent. [3m]

hostlink <- read_html("https://en.wikipedia.org/wiki/List_of_Olympic_Games_host_cities")

host <- hostlink %>% 
            html_nodes("table:nth-child(19)") %>% #2m
            html_table() %>% 
            `[[` (1) %>% # to extract the first element from the list
            select(-1) # delete first column as the column of City is duplicated.

head(host)
## # A tibble: 6 × 9
##   City         Country  Year Continent Summer Winter `Opening ceremo… `Closing ceremo…
##   <chr>        <chr>   <int> <chr>     <chr>  <chr>  <chr>            <chr>           
## 1 Athens       Greece   1896 Europe    S005I  ""     6 April 1896     15 April 1896   
## 2 Paris        France   1900 Europe    S005II ""     14 May 1900      28 October 1900 
## 3 St. Louis[a] United…  1904 North Am… S005I… ""     1 July 1904      23 November 1904
## 4 London[b]    United…  1908 Europe    S005IV ""     27 April 1908    31 October 1908 
## 5 Stockholm    Sweden   1912 Europe    S005V  ""     6 July 1912      22 July 1912    
## 6 Berlin       Germany  1916 Europe    S006VI ""     Cancelled due t… Cancelled due t…
## # … with 1 more variable:
## #   .mw-parser-output .tooltip-dotted{border-bottom:1px dotted;cursor:help}Ref <chr>

5. Merge the host table with the data_noc table. [Note: some years might have 2 hosting cities)] [4m]

data_complete <- data_noc %>% 
    left_join(host %>% select(Year, City, Country), # only select relevant columns
              by = c("Year","City"))

head(data_complete)
## # A tibble: 6 × 17
##      ID Name    Sex     Age Height Weight Team   NOC   Games   Year Season City 
##   <dbl> <chr>   <chr> <dbl>  <dbl>  <dbl> <chr>  <chr> <chr>  <dbl> <chr>  <chr>
## 1     1 A Diji… M        24    180     80 China  CHN   1992 …  1992 Summer Barc…
## 2     2 A Lamu… M        23    170     60 China  CHN   2012 …  2012 Summer Lond…
## 3     3 Gunnar… M        24     NA     NA Denma… DEN   1920 …  1920 Summer Antw…
## 4     4 Edgar … M        34     NA     NA Denma… DEN   1900 …  1900 Summer Paris
## 5     5 Christ… F        21    185     82 Nethe… NED   1988 …  1988 Winter Calg…
## 6     5 Christ… F        21    185     82 Nethe… NED   1988 …  1988 Winter Calg…
## # … with 5 more variables: Sport <chr>, Event <chr>, Medal <chr>, region <chr>,
## #   Country <chr>

6. There are some missing values in the variable Country. Find out which cities and the years have missing values of Country. [hint: The table should not have any duplicate values. You might want to look up some new function for this which we have not taught in class.] [3m]

data_complete %>% 
   filter(is.na(Country)) %>% #2m
    select(Year, City, Country) %>% 
    distinct() # the function that get rid of duplicated values
## # A tibble: 15 × 3
##     Year City         Country
##    <dbl> <chr>        <chr>  
##  1  1920 Antwerpen    <NA>   
##  2  2006 Torino       <NA>   
##  3  2008 Beijing      <NA>   
##  4  2004 Athina       <NA>   
##  5  1960 Squaw Valley <NA>   
##  6  1956 Melbourne    <NA>   
##  7  1960 Roma         <NA>   
##  8  1980 Moskva       <NA>   
##  9  1976 Innsbruck    <NA>   
## 10  1904 St. Louis    <NA>   
## 11  1928 Sankt Moritz <NA>   
## 12  1908 London       <NA>   
## 13  1948 Sankt Moritz <NA>   
## 14  1896 Athina       <NA>   
## 15  1956 Stockholm    <NA>

Now that we know which City (and Year) have issues, we have to check the wiki table and make edits to our dataset. The city name from the wiki table contains some footnote numbering. I have identified those cases for you as below: (The left is from the wiki table, the right is from the ahtlete_events data)

  1. Athens = Athina
  2. St. Louis[a] = St. Louis
  3. London[b] = London
  4. Antwerp[c] = Antwerpen
  5. St. Moritz = Sankt Moritz
  6. MelbourneStockholm = Melbourne [note: Two cities are hosted country, change it to Melbourne and add extra rows for Stockholm as below.]
  7. Rome = Roma
  8. Innsbruck[g] = Innsbruck
  9. Moscow = Moskva
  10. Turin = Torino
  11. Beijing[i] = Beijing

Add an extra row for Year (1956), City (Stockholm), Country (Sweden).
Also, change the corresponding row of MelbourneStockholm’s host country to just Australia.

(Phew! it is a lot of works isn’t it, welcome to the real life of data analyst! 😓)

I have done it for you here, as below. Please try to understand the code below as they are very useful functions!

host_c <- host %>% 
   mutate(City = case_when(City == "Athens" ~ "Athina",
                         City == "St. Louis[a]" ~ "St. Louis",
                         City == "London[b]" ~ "London",
                         City == "Antwerp[c]" ~ "Antwerpen",
                         City == "St. Moritz" ~ "Sankt Moritz",
                         City == "MelbourneStockholm[f]" ~ "Melbourne",
                         City == "Rome" ~ "Roma",
                         City == "Innsbruck[g]" ~ "Innsbruck",
                         City == "Moscow" ~ "Moskva",
                         City == "Turin" ~ "Torino",
                         City == "Beijing[i]" ~ "Beijing",
                         City == "Palisades Tahoe, then called Squaw Valley" ~ "Squaw Valley",
                         TRUE ~ City)) %>% 
    add_row(City = "Stockholm", Year = 1956, Country = "Sweden") %>% 
    mutate_at("Country", str_replace, "Australia\\sSweden", "Australia" )
# the "\s" is because of regular expression of a space

7. Now, merge the data again and check whether there is any missing value in the Country variable. [6m]

final_data <- data_noc %>% 
    left_join(host_c %>% select(Year, City, Country), # only select relevant columns
              by = c("Year","City")) #2m

head(final_data)
## # A tibble: 6 × 17
##      ID Name    Sex     Age Height Weight Team   NOC   Games   Year Season City 
##   <dbl> <chr>   <chr> <dbl>  <dbl>  <dbl> <chr>  <chr> <chr>  <dbl> <chr>  <chr>
## 1     1 A Diji… M        24    180     80 China  CHN   1992 …  1992 Summer Barc…
## 2     2 A Lamu… M        23    170     60 China  CHN   2012 …  2012 Summer Lond…
## 3     3 Gunnar… M        24     NA     NA Denma… DEN   1920 …  1920 Summer Antw…
## 4     4 Edgar … M        34     NA     NA Denma… DEN   1900 …  1900 Summer Paris
## 5     5 Christ… F        21    185     82 Nethe… NED   1988 …  1988 Winter Calg…
## 6     5 Christ… F        21    185     82 Nethe… NED   1988 …  1988 Winter Calg…
## # … with 5 more variables: Sport <chr>, Event <chr>, Medal <chr>, region <chr>,
## #   Country <chr>
any_na(final_data$Country)
## [1] FALSE

Now, the data is in good shape and we are ready to explore the data in more detail.

Section B: Exploring Missing Values

8. What is the percentage of missing values in final_data? [Inline code is required. [1m +1m]

pct_miss(final_data)
## [1] 7.843654

The percentage of missing values is 7.843654%.

9. Construct an appropriate plot to show the co-occurrence of missing values across variables. [1m]

gg_miss_upset(final_data) 

10. Based on your plot in question 8, answer the following:

10a. How many observations are missing only for Medal? Does it raise a concern? [Do not need inline coding.] [2m]

There are 175591 observations missing only for ‘Medal’. These are high numbers for missing observations and so it does raise a concern of having these many missing observations, that too only for the Medal column. It will increase the percentage of missing values for the entire table, but the possible reason is it maybe due to no medal earned for the particular candidate and then lead to reduced in statistical power and representative of the samples.

10b. What is the percentage of missing values for Height and Weight only regardless of whether they have won a medal? [Do not need inline coding, but if you wish you can have it here.] [2m+1m]

# coding here
pct_miss_HW <- final_data %>%
  select(Height,Weight) %>%
  miss_case_table()

The total percentage of missing for Height and Weight regardless of they have won the medal is 21.2745422%.

11. Are the Olympics games authorities paying more attention towards data collection over the years? (Focus on only the Summer games). Do you observe an increase in data completeness? [hint: Visualize the data completeness for different variables over the years.] What do you learn from this graph? Which variables contain the most missing values? Why there is a gap in the plot? Reproduce the plot below and answer the question. [8m]

# many lines of coding here!
final_data %>%
  filter(Season == "Summer") %>%
  select(!c("ID","Medal")) %>%
  gg_miss_fct(fct = Year) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_x_continuous(breaks=seq(1896, 2016, 4))

Indeed. The Olympics games authorities are paying more attention towards the collection of data over the years. From the graph, it is proven that the data is complete as the variables like height, weight and the age are getting better over the time duration. The ‘Medal’ column has the most number of missing values. There’s a gap between the years 1912- 1920 and 1936-1944 approximately. This might be because of the World war-1 and world war-2.

12. How would you deal with the missing values? Is the missing is at random? [2m]

No, the missing is not at random. What year the “height” is missing, then that year the “Weight” seems to be missing as well.

Some way to treat missing value: 1. Deleting the observations 2. Deleting the variable 3. Imputation with mean / median / mode 4. Prediction of missing values.

For example,for “Weight”, we can estimate a model, then using the model to predict some values to fill in to replace NAs.

Section C: Data Exploration

13. Count the number of times a country has hosted Olympics games using the host_c data and construct a relevant plot (in descending order of the number of times countries hosted the Olympics). [4m]

host_c %>% 
  count(Country)%>% 
    ggplot(aes(x = reorder(Country, -n),
               y = n)) + 
    geom_col() +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    labs(y = "Number of hosting",
         x = "",
         title = "Countries that hosted Olympics from 1956-2016") 

14. How many athletes took part each year in each Season? The table should have 3 columns which are Year, Summer and Winter. Sort the data with the highest value for Summer. [5m]

final_data %>% 
    count(Year, Season) %>%
    pivot_wider(id_cols = Year,
                names_from = Season,
                values_from = n) %>% 
    arrange(desc(Summer))
## # A tibble: 34 × 3
##     Year Summer Winter
##    <dbl>  <int>  <int>
##  1  2000  13821     NA
##  2  1996  13780     NA
##  3  2016  13688     NA
##  4  2008  13602     NA
##  5  2004  13443     NA
##  6  1992  12977   3436
##  7  2012  12920     NA
##  8  1988  12037   2639
##  9  1972  10304   1655
## 10  1984   9454   2134
## # … with 24 more rows

15. Plot an appropriate graph to show the trend of the number of athletes involved in both seasons. Rename the y-axis. [3m]

final_data %>% 
    count(Year, Season) %>% # no mark, same as question 2
    ggplot(aes(x = Year,
               y = n, 
               colour = Season)) +
    geom_line() +
    labs(y = "Number of Athletes") -> q12

q12

16. Find out why there is a peak and trough in the number of athletes by constructing an interactive plot. Which games show dips in the trend? List 3 of them (in years). (no inline code is required.) [1m+2m]

ggplotly(q12)

I observed that when focus on Summer games, there are three dips in 1932 with 2969 athletes, 1956 with 5127 athletes and 1980 with 7191 athletes.

17. Add geom_text to show the name of the host city of 3 games that you mentioned in the previous question in the graph (focus only on the Summer games). Find out why there are two Cities for holding the Olympic games. Write a small paragraph about it. [4m + 2m]

dip <- final_data %>% 
          filter(Year %in% c(1932,1956,1980) & Season == "Summer")  %>% #3m
          count(Year, Season, City)
          
q12 +
    geom_text(aes(label = City), colour = "blue", data = dip) # Ask among your group member why the colour argument is not placed inside the aesthetics?

From the graph shown, in 1956 there are two hosting cities Melbourne and Stockholm. This was the first time the Olympics would be held in the Southern Hemisphere and Oceania, and it also marked the first occasion that the Games were played outside of Europe and North America. However, in the lead up to the Games, there were a series of boycotts, political problems, and controversy. Australian equine quarantine laws prevented the equestrian events from taking place in Melbourne. Therefore, the equestrian events at the 1956 Summer Olympics were held in Stockholm and included dressage, eventing, and show jumping.

18. Did the number of events in the Olympic games change over time? Construct a relevant plot showing a trend each for the Summer and Winter games. [2m]

data %>% 
    group_by(Year, Season) %>%
    summarise(n_event = length(unique(Event))) %>% 
    ggplot(aes(x = Year, 
               y = n_event,
               colour = Season)) +
    geom_point() +
    geom_line() + # show the trend
    labs(y = "Number of events",
         title = "Increasing number of events in the Olympics games") +
    theme_minimal()
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.

From here onwards, we will focus on just the Summer games.

final_data2 <- final_data %>% 
    filter(Season == "Summer")

19. Do countries win more medals if they are the hosts? According to research (Clarke, 2000), the host countries will win more medals due to several factors such as home-field advantage, crowd support, and higher participation rate. Let’s just focus on Australia and see if the data are consistent with the research by Clarke.

19a. Count the number of medals that Australia has won since 1896. [4m]

medals_won <- final_data2 %>% 
    filter(region == "Australia") %>% 
    select(Games, Year, Medal, Sport, Event, Country, City) %>% 
    distinct() %>% 
    group_by(Games, Year, City, Country) %>% 
    summarise(total_medals = sum(n_complete(Medal), na.rm = TRUE)) 
## `summarise()` has grouped output by 'Games', 'Year', 'City'. You can override using the `.groups` argument.
medals_won
## # A tibble: 29 × 5
## # Groups:   Games, Year, City [29]
##    Games        Year City        Country        total_medals
##    <chr>       <dbl> <chr>       <chr>                 <int>
##  1 1896 Summer  1896 Athina      Greece                    3
##  2 1900 Summer  1900 Paris       France                    6
##  3 1904 Summer  1904 St. Louis   United States             4
##  4 1908 Summer  1908 London      United Kingdom            5
##  5 1912 Summer  1912 Stockholm   Sweden                    7
##  6 1920 Summer  1920 Antwerpen   Belgium                   3
##  7 1924 Summer  1924 Paris       France                    6
##  8 1928 Summer  1928 Amsterdam   Netherlands               4
##  9 1932 Summer  1932 Los Angeles United States             5
## 10 1936 Summer  1936 Berlin      Germany                   1
## # … with 19 more rows

19b. In which year(s) did Australia host the Olympic games? [Inline coding is needed.] [2m+2m]

yearhost <- medals_won %>%  
    filter(Country == "Australia")

yearhost
## # A tibble: 2 × 5
## # Groups:   Games, Year, City [2]
##   Games        Year City      Country   total_medals
##   <chr>       <dbl> <chr>     <chr>            <int>
## 1 1956 Summer  1956 Melbourne Australia           35
## 2 2000 Summer  2000 Sydney    Australia           58

In year 1956 and 2000Australia host the Olympic game.

19c. Combine 17(a) and 17(b) and construct a line plot to show the number of medals won by Australia. Circle and show the name of the host city of the two data points which correspond to the years when Australia hosted the games. [4m]

medals_won %>% 
    ggplot(aes(x = Year,
               y = total_medals)) + 
    geom_line(aes()) +
    geom_point(shape = 1, size = 3, colour = "red",  data = yearhost) +
    geom_label_repel(aes(label = City), color = "blue", data = yearhost) +
    labs(y = "Total Medals",
         title = "Number of medals won by Australia at Olympic games")

20. Is there an equal participation rate (in number) between male and female athletes over time? Count the numbers and plot the data. What do you observe? [4m +1m]

n_sex <- final_data2 %>% 
    group_by(Year, Sex) %>% 
    summarize(n_ath = sum(n())) # you might want to fill up multiple function here #3m
## `summarise()` has grouped output by 'Year'. You can override using the `.groups` argument.
n_sex
## # A tibble: 55 × 3
## # Groups:   Year [28]
##     Year Sex   n_ath
##    <dbl> <chr> <int>
##  1  1896 M       380
##  2  1900 F        33
##  3  1900 M      1903
##  4  1904 F        16
##  5  1904 M      1285
##  6  1908 F        47
##  7  1908 M      3054
##  8  1912 F        87
##  9  1912 M      3953
## 10  1920 F       134
## # … with 45 more rows
ggplot(n_sex, aes(x = Year, y =n_ath,
                  group = Sex, colour = Sex)) +
    geom_col(position = "fill")

I observed that participation rate between male and female are not equal, male participants are always more than female over the time from 1890 till now. But looking at the graph, female participants isincreased while male participants is decreased over the time , so it is unsure to say the male participants will still stay more than female in later years.

21. Who are the Top 20 (based on the total number of medals won) athletes in the History of the Olympic Games? [4m]

final_data2 %>% 
  group_by(Name, Team) %>% 
  summarize(n_medal = n_complete(Medal)) %>% # multiple functions are allowed here #2m
  arrange(desc(n_medal)) %>% 
  top_n(20)
## `summarise()` has grouped output by 'Name'. You can override using the `.groups` argument.
## Selecting by n_medal
## # A tibble: 118,667 × 3
## # Groups:   Name [115,524]
##    Name                                                Team          n_medal
##    <chr>                                               <chr>           <int>
##  1 "Michael Fred Phelps, II"                           United States      28
##  2 "Larysa Semenivna Latynina (Diriy-)"                Soviet Union       18
##  3 "Nikolay Yefimovich Andrianov"                      Soviet Union       15
##  4 "Borys Anfiyanovych Shakhlin"                       Soviet Union       13
##  5 "Edoardo Mangiarotti"                               Italy              13
##  6 "Takashi Ono"                                       Japan              13
##  7 "Aleksey Yuryevich Nemov"                           Russia             12
##  8 "Dara Grace Torres (-Hoffman, -Minas)"              United States      12
##  9 "Jennifer Elisabeth \"Jenny\" Thompson (-Cumpelik)" United States      12
## 10 "Natalie Anne Coughlin (-Hall)"                     United States      12
## # … with 118,657 more rows

There are a lot more analysis that you can do, but I will stop here. Please explore the data on your own time 😇.

Section D: Modelling

Let’s predict the number of medals won!

According to research, country GDP, population, home field advantage are the most common factors in predicting the medal won by the country.

22. Import the GDP and the Population data from the data folder. [3m]

world_pop <- read_excel("data/world_pop.xlsx")
world_gdp <- read_excel("data/world_gdp.xlsx", range = cell_limits(c(4,1),c(NA,NA)))

head(world_pop)
## # A tibble: 6 × 61
##   Country  `Country Code` `Indicator Name` `Indicator Code` `1960` `1961` `1962`
##   <chr>    <chr>          <chr>            <chr>             <dbl>  <dbl>  <dbl>
## 1 Aruba    ABW            Population, tot… SP.POP.TOTL      5.42e4 5.54e4 5.62e4
## 2 Afghani… AFG            Population, tot… SP.POP.TOTL      9.00e6 9.17e6 9.35e6
## 3 Angola   AGO            Population, tot… SP.POP.TOTL      5.64e6 5.75e6 5.87e6
## 4 Albania  ALB            Population, tot… SP.POP.TOTL      1.61e6 1.66e6 1.71e6
## 5 Andorra  AND            Population, tot… SP.POP.TOTL      1.34e4 1.44e4 1.54e4
## 6 UAE      ARE            Population, tot… SP.POP.TOTL      9.26e4 1.01e5 1.12e5
## # … with 54 more variables: 1963 <dbl>, 1964 <dbl>, 1965 <dbl>, 1966 <dbl>,
## #   1967 <dbl>, 1968 <dbl>, 1969 <dbl>, 1970 <dbl>, 1971 <dbl>, 1972 <dbl>,
## #   1973 <dbl>, 1974 <dbl>, 1975 <dbl>, 1976 <dbl>, 1977 <dbl>, 1978 <dbl>,
## #   1979 <dbl>, 1980 <dbl>, 1981 <dbl>, 1982 <dbl>, 1983 <dbl>, 1984 <dbl>,
## #   1985 <dbl>, 1986 <dbl>, 1987 <dbl>, 1988 <dbl>, 1989 <dbl>, 1990 <dbl>,
## #   1991 <dbl>, 1992 <dbl>, 1993 <dbl>, 1994 <dbl>, 1995 <dbl>, 1996 <dbl>,
## #   1997 <dbl>, 1998 <dbl>, 1999 <dbl>, 2000 <dbl>, 2001 <dbl>, 2002 <dbl>, …
head(world_gdp)
## # A tibble: 6 × 61
##   `Country Name` `Country Code` `Indicator Name`  `Indicator Code`     `1960`
##   <chr>          <chr>          <chr>             <chr>                 <dbl>
## 1 Afghanistan    AFG            GDP (current US$) NY.GDP.MKTP.CD    537777811
## 2 Albania        ALB            GDP (current US$) NY.GDP.MKTP.CD           NA
## 3 Algeria        DZA            GDP (current US$) NY.GDP.MKTP.CD   2723648552
## 4 American Samoa ASM            GDP (current US$) NY.GDP.MKTP.CD           NA
## 5 Andorra        AND            GDP (current US$) NY.GDP.MKTP.CD           NA
## 6 Angola         AGO            GDP (current US$) NY.GDP.MKTP.CD           NA
## # … with 56 more variables: 1961 <dbl>, 1962 <dbl>, 1963 <dbl>, 1964 <dbl>,
## #   1965 <dbl>, 1966 <dbl>, 1967 <dbl>, 1968 <dbl>, 1969 <dbl>, 1970 <dbl>,
## #   1971 <dbl>, 1972 <dbl>, 1973 <dbl>, 1974 <dbl>, 1975 <dbl>, 1976 <dbl>,
## #   1977 <dbl>, 1978 <dbl>, 1979 <dbl>, 1980 <dbl>, 1981 <dbl>, 1982 <dbl>,
## #   1983 <dbl>, 1984 <dbl>, 1985 <dbl>, 1986 <dbl>, 1987 <dbl>, 1988 <dbl>,
## #   1989 <dbl>, 1990 <dbl>, 1991 <dbl>, 1992 <dbl>, 1993 <dbl>, 1994 <dbl>,
## #   1995 <dbl>, 1996 <dbl>, 1997 <dbl>, 1998 <dbl>, 1999 <dbl>, 2000 <dbl>, …

23. Unfortunately, the years of the data are in columns. So let’s change them to be displayed in rows. The new column should be named Year, the column for population should be named pop and the column for gdp should be named gdp. [6m]

world_pop1 <- world_pop %>% 
               pivot_longer(cols = -c("Country","Country Code", "Indicator Name", "Indicator Code"),
                   names_to = "Year",
                   values_to = "pop") %>% 
               clean_names() %>% 
               mutate(year = as.numeric(year))
  
head(world_pop1)  
## # A tibble: 6 × 6
##   country country_code indicator_name    indicator_code  year   pop
##   <chr>   <chr>        <chr>             <chr>          <dbl> <dbl>
## 1 Aruba   ABW          Population, total SP.POP.TOTL     1960 54211
## 2 Aruba   ABW          Population, total SP.POP.TOTL     1961 55438
## 3 Aruba   ABW          Population, total SP.POP.TOTL     1962 56225
## 4 Aruba   ABW          Population, total SP.POP.TOTL     1963 56695
## 5 Aruba   ABW          Population, total SP.POP.TOTL     1964 57032
## 6 Aruba   ABW          Population, total SP.POP.TOTL     1965 57360
world_gdp1 <- world_gdp %>% 
               pivot_longer(cols = -c("Country Name","Country Code", "Indicator Name", "Indicator Code"),
                   names_to = "Year",
                   values_to = "gdp") %>% 
               clean_names() %>% 
               mutate(year = as.numeric(year))

head(world_gdp1) 
## # A tibble: 6 × 6
##   country_name country_code indicator_name    indicator_code  year        gdp
##   <chr>        <chr>        <chr>             <chr>          <dbl>      <dbl>
## 1 Afghanistan  AFG          GDP (current US$) NY.GDP.MKTP.CD  1960  537777811
## 2 Afghanistan  AFG          GDP (current US$) NY.GDP.MKTP.CD  1961  548888896
## 3 Afghanistan  AFG          GDP (current US$) NY.GDP.MKTP.CD  1962  546666678
## 4 Afghanistan  AFG          GDP (current US$) NY.GDP.MKTP.CD  1963  751111191
## 5 Afghanistan  AFG          GDP (current US$) NY.GDP.MKTP.CD  1964  800000044
## 6 Afghanistan  AFG          GDP (current US$) NY.GDP.MKTP.CD  1965 1006666638

24. Now, let’s consider only the data from 1992 (inclusive). Create a column homefield, indicating 1 if the Team is the same as the hosted Country variable, else 0. Calculate the total medals won by each Team. [6m]

final_data2 %>% 
    filter(Year %in% c(1992:2016)) %>% 
    mutate(homefield = ifelse(Team == Country, 1, 0)) %>% 
    select(Games, Year, Medal, Sport, Event, Country, City, Team, homefield) %>% 
    distinct() %>% 
    group_by(Year, Country, Team, homefield) %>% 
    summarise(total_medals = sum(n_complete(Medal), na.rm = TRUE)) -> final
## `summarise()` has grouped output by 'Year', 'Country', 'Team'. You can override using the `.groups` argument.
head(final)
## # A tibble: 6 × 5
## # Groups:   Year, Country, Team [6]
##    Year Country Team                homefield total_medals
##   <dbl> <chr>   <chr>                   <dbl>        <int>
## 1  1992 Spain   Albania                     0            0
## 2  1992 Spain   Algeria                     0            2
## 3  1992 Spain   American Samoa              0            0
## 4  1992 Spain   Andorra                     0            0
## 5  1992 Spain   Angola                      0            0
## 6  1992 Spain   Antigua and Barbuda         0            0

25. We have all our data ready. Combine the data so that gdp and pop data are included in one data set. [4m]

final %>% 
  left_join(world_gdp1 %>% select(year, country_name, gdp),
            by = c("Team" = "country_name", "Year" = "year")) %>% 
  left_join(world_pop1 %>% select(year, country, pop),
            by = c("Team" = "country", "Year" = "year")) %>% 
  mutate(gdp_m = gdp/1000000000,
         pop_m = pop/1000000000) -> model_data

26. It is important to examine whether the model fulfil the assumption of normality in the dependent variable. Draw a histogram of the y variable. Discuss the outcome. [1m]

ggplot(model_data,
       aes(x = total_medals)) +
       geom_histogram() 
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The y variable appears to have a lot of zero values and it is a right-skewed data.

27. Now regress the y variable with gdp_m, pop_m and homefield data. Interpret the coefficients of the model. [4m+ 3m]

linear_model <- lm(total_medals ~ gdp_m + pop_m + homefield , data = model_data) # the first argument contains a long coding. # 3m

tidy(linear_model)
## # A tibble: 4 × 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  2.15     0.260         8.26 4.10e-16
## 2 gdp_m        0.00898  0.000401     22.4  7.21e-92
## 3 pop_m        6.69     2.06          3.25 1.20e- 3
## 4 homefield   22.9      3.71          6.17 9.38e-10

The Y-variable and the other dependent variables are corresponding to each other accordingly. In positive coefficient, Country is expected to earn more total medals, on average, by 0.0090, for each additional $1000000000 GDP. Country is expected to earn more total medals, on average, by 6.6929, for each additional 1000000000 population. Country is expected to earn more total medals, on average, by 22.9147, if it is at athletes homefield.

28. Obtain and discuss the goodness of fit of the model. [1m + 2m]

glance(linear_model)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.442         0.440  8.18      288. 9.07e-138     3 -3857. 7724. 7749.
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

Roughly 44.03% of the variability in total_medals can be explained by gdp_m, pop_m and homefield. So the goodness of fit of the model is considered to be relative weak, since the explained sample in the model only below 50%.

29. Examine the residual to see whether there is any left over pattern for further investigation about model accuracy. Explain. [4m + 1m]

# combine all data
model_data_complete <- augment(linear_model) 
ggplot(data = model_data_complete,
       aes(x = .fitted,
           y = .resid)) +
  geom_point() +
  geom_hline(yintercept = 0, colour = "red")

# oR

resid_panel(linear_model, plot = "all")

I observed that the model doesn’t really have residuals around the red line as a perfect fit model would have a horizontal line at zero. Also the residuals are kind of right skewed distributed.

30. Predict the total medals won by Australia using the above model and compare the predicted value to the actual data. The information for Australia’s Gross Domestic Product in US dollar and total population size are as below. [3m = Calculation + comparison + explanation]

Australia GDP: 1,359,330,000,000
Australia population as of 2020: 25,694,393

Calculation: total_medals = 2.147314278 + 0.008981092gdp_m + 6.692875882pop_m + 22.914676098homefield

total_medals^hat = 2.147314278 + 0.008981092* 1359330000000/1000000000 + 6.692875882* 25694393/1000000000 + 22.914676098* 0 = 14.52755

Comparison: In Olympic Games Tokyo 2020, Australia won 46 medals in total.

Compare with the predicted value from created regression model, the predicted value is differ with the actual data. Actual data is 46 medals but predicted value is around 15 medals.

Explanation: The predicted value is smaller than the predicted value by more than half, which indicates this regression model is not an appropriate model to predict medal earns in Olympics. The reason for the large difference between the predicted value and the actual data may be the lack of variables is added in, there should be others dependent variables which will affect the number of medal earns.

References

The original Olympics Games data was obtained from https://www.kaggle.com/heesoo37/120-years-of-olympic-history-athletes-and-results

The population and GDP data was obtained from https://www.gapminder.org/data/